Code Monkey home page Code Monkey logo

k8s-monitoring-helm's Introduction

Kubernetes Monitoring Helm Charts

Maintainers

Name Email Url
petewall [email protected]
skl [email protected]

Usage

Helm must be installed to use the chart. Please refer to Helm's documentation to get started.

Once Helm is set up properly, add the repo as follows:

helm repo add grafana https://grafana.github.io/helm-charts

See the Chart Documentation for chart install instructions.

Contributing

See our Contributing Guide for more information.

Links

k8s-monitoring-helm's People

Contributors

aptomaketil avatar basvdl avatar bytheway avatar caleb-devops avatar cedricziel avatar claudioscalzo avatar cornelius-keller avatar dependabot[bot] avatar dmouse avatar duncan485 avatar eamonryan avatar github-actions[bot] avatar gouthamve avatar hbjydev avatar iamdmitrij avatar ilian avatar jewbetcha avatar leszekblazewski avatar mar4uk avatar mattsimonsen avatar n888 avatar peterolivo avatar petewall avatar qclaogui avatar roya avatar seamusgrafana avatar selyx avatar skl avatar t00mas avatar vad1mo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

k8s-monitoring-helm's Issues

GitHub release action is not great for collaboration

The current GitHub action where we release on commits to main is nice for automation, but it has three implications:

  1. All community PRs need to know to bump the version and re-run helm-docs.
  2. Commits without a version bump will re-write the current release
  3. It's cumbersome to batch commits into a single release. You need to merge to a staging or development branch, then merge to main

I propose we turn off auto-releasing and instead make a manual GitHub action to release.

Agent isn't setting __replica__ label by default

Having recently switched to this helm chart as we're building out a Mimir setup, I think I've run into an issue.

When setting the agent statefulset to run multiple replicas for HA mode, I get lots of rejected duplicate samples in the logs and the Mimir HA dashboard shows it's not recognised as a HA agent setup.

I believe Mimir by default expects the replica label to be set on all incoming metrics but that this isn't being set by default in this chart.

I did try setting __replica__: ${POD_NAME} in the externalLabels section of the chart, but this didn't seem to register correctly (it ended up sending ${POD_NAME} as the replica name which of course didn't work).

Am I missing something in the chart values for this to be setup correctly as I think it should be the default for the agent when running it in HA mode?

Add more flexibility in label matching

For example, Node Exporter always matches the label:
app.kubernetes.io/name=promethus-node-exporter.*

But on OpenShift, it's just app.kubernetes.io/name=node-exporter

Perhaps, each metric source could have a labelMatcher section with defaults like:

metics:
  node-exporter:
    labelMatchers:
    - key: app.kubernetes.io/name
      value: node-exporter

but, it would need to be overridable...

Force externalServices usernames and passwords to strings

If the username or password is numeric, YAML interprets it as a number, but it fails the templates:

Error: template: k8s-monitoring/templates/credentials.yaml:10:187: executing "k8s-monitoring/templates/credentials.yaml" at <b64enc>: wrong type for value; expected string; got float64

We should force those values to strings, no matter what.

Add log level to pod logs

Suggestion on slack:

I was missing the colorisation of logs from my pods in Loki explore mod. Adding the following to the grafana agents logs ConfigMap fixed it.
Instead of:

pipeline_stages:
            - docker: {}

I have:

pipeline_stages:
            - docker: {}
            - regex:
                expression: “(?P<level>[a-zA-Z]+):(.*)”
            - labels:
                level:

relabeler with cluster.name

I have many clusters which use the same instance of grafana-cloud with similar things deployed. For instance, in my values file, i specify the following:

extraConfig: |-
  // Get all redis nodes
  prometheus.exporter.redis "redis" {
    redis_addr = "redis.redis:6379"
    namespace  = "redis"
    is_cluster = true
  }

  // Scrape redis nodes
  prometheus.scrape "redis" {
    targets    = prometheus.exporter.redis.redis.targets
    forward_to = [prometheus.relabel.redis.receiver]
  }

  // do some relabelling and forward metrics to grafana
  prometheus.relabel "redis" {
    forward_to = [prometheus.remote_write.grafana_cloud_prometheus.receiver]

    rule {
      source_labels = ["__name__"]
      regex         = "redis_blocked_clients|redis_cluster_slots_fail|redis_cluster_slots_pfail|redis_cluster_state|redis_commands_duration_seconds_total|redis_commands_total|redis_connected_clients|redis_connected_slaves|redis_db_keys|redis_db_keys_expiring|redis_evicted_keys_total|redis_keyspace_hits_total|redis_keyspace_misses_total|redis_master_last_io_seconds_ago|redis_memory_fragmentation_ratio|redis_memory_max_bytes|redis_memory_used_bytes|redis_memory_used_rss_bytes|redis_total_system_memory_bytes|redis_up"
      action        = "keep"
    }

    rule {
      source_labels = ["instance"]
      action = "replace"
      replacement = "development"
      target_label = "instance"
    }
  }

It would be neat to have a consistent prometheus.relabel configuration which could change the instance label to the cluster.name. I'm trying to reduce duplication where applicable so that i can have one highly re-usable configuration whilst ensuring that data sent up to grafana has the relevant instance in for the relevant dashboards.

Ideally, i could change the above to something like:

prometheus.relabel "redis" {
    // Changed to the "rename instance label and ship"
    forward_to = [prometheus.relabel.instance.receiver]

    rule {
      source_labels = ["__name__"]
      regex         = "redis_blocked_clients|redis_cluster_slots_fail|redis_cluster_slots_pfail|redis_cluster_state|redis_commands_duration_seconds_total|redis_commands_total|redis_connected_clients|redis_connected_slaves|redis_db_keys|redis_db_keys_expiring|redis_evicted_keys_total|redis_keyspace_hits_total|redis_keyspace_misses_total|redis_master_last_io_seconds_ago|redis_memory_fragmentation_ratio|redis_memory_max_bytes|redis_memory_used_bytes|redis_memory_used_rss_bytes|redis_total_system_memory_bytes|redis_up"
      action        = "keep"
    }
  }

I only see this working "once" although if possible to make the instance piece more distinguishable for those who might deploy something to the same cluster would be useful?

  • one instance of thing -> instance_label: <cluster.name>
  • multiple instances of thing -> instance_label: <cluster.name>-something <-- i have no idea what something would be but prehaps the name of the thing that called the relabel receiver (i don't know the ins and outs of the flow mode yet so really getting to understand it)

Allow basicAuth username/password to be sourced from secrets

We use terraform to control the configuration of this helm chart and it means that we have to check the raw username and password values into our terraform source code, which is a very bad practice. It would be much safer and cleaner if we had the option of using values from regular kubernetes secrets which can be managed in a much more secure way.

Pod logs are missing certain fields

When processing logs, the static config sets these fields, while the flow config in this chart does not:

  • namespace, from __meta_kubernetes_namespace
  • pod, from __meta_kubernetes_pod_name
  • container, from __meta_kubernetes_pod_container_name

Also, the job field is set to <namespace>/<pod>, which we should match.

Add scrape intervals to each metric?

I've heard that there was confusion over the default scrape interval being used when scraping metrics.

The default is 60s, which is set in prometheus.scrape. However, a user not familiar with flow might not know that is the default. They wouldn't necessarily know to dig down into agent documentation.

If we set:

node-exporter:
  scrapeInterval: 60s

And use that in the generated config, it would make it explicit.

HTTP status 404 Not Found: 404 page not found

Hi

I am getting 400 error for my pushgateway from the grafana agent. But when I write sample data through a curl command it worked

ts=2023-08-30T19:16:22.239618449Z component=prometheus.remote_write.grafana_cloud_prometheus subcomponent=rw level=error remote_name=8117c6 url=https://XXXXXX:XXXXXXX%[email protected]/metrics/job/some_job msg="non-recoverable error" count=564 exemplarCount=0 err="server returned HTTP status 400 Bad Request: snappy: corrupt input"

Incorrect logs parsing

Heyo,

I've been using v0.0.15 of this chart and everything was working ok - both logs and metrics.
After upgrade (don't remember exact version (>v0.1.x) I started receiving strange logs.

Logs look now like that:

<time> <output stream> F <log_line>
2023-08-29T19:19:00.559181246Z stdout F {"jsonLogKey": "value"}

which breaks json parsing in Loki.

Tested with the latest version (0.1.15) and problem still exists.

Move the cluster label to discovery.relabel

prometheus.relabel can be expensive (less so for keep|drop operations), so a call that just sets the cluster label should be moved to the discovery.relabel components for each metric source.

Things to consider:

  • Does this affect how ServiceMonitor and PodMonitor objects set the cluster label?
  • Should we have a section in the docs about best practice for adding scrape targets?

Allow for pod logs to use custom discovery list

Currently, there isn't a clean way to get pod logs from a subset of pods (i.e. only from a certain set of namespaces).

We should make a way to use a different discovery component.

Perhaps by giving a custom component:

logs:
  pod_logs:
    target: "discovery.kubernetes.my_pods_for_logs.targets"

extraConfig: |-
  discovery.kubernetes "my_pods_for_logs" {
    role = "pod"
    namespaces {
      names = ["mynamespace"]
    }
  }

`loki.source.kubernetes` is not usable

There's a bug in loki.source.kubernetes where it stops tailing logs after log file rotation. It's actually an issue with the underlying Kubernetes API, with a PR in process, but that will likely take some time and will only work on new enough versions of Kubernetes.

We should move off of loki.source.kubernetes and on to loki.source.file, which requires the Grafana Agent to be in daemonset mode.

Add a post-install hook to validate the configuration

This would be a Job that would run grafana-agent fmt to validate the configuration is valid syntax. This would be useful, especially when including .extraConfig, to catch issues when running helm install, rather than waiting for grafana-agent pods to crash.

Loki TenantID should not be treated as a secret

The new tenant id field is loaded as a secret.
loki.write has tenant_id as a string type. You cannot put a secret into a string.

So, either we stop treating tenant id as a secret, or we wrap it in [nonsenstive()](https://grafana.com/docs/agent/latest/flow/reference/stdlib/nonsensitive/).

We should look at how the prometheus tenant id is set, too.

Need a document: "Scraping metrics from my service"

We should add a document that answers the question "I want to scrape metrics from my pod or service" and go step-by-step through doing that. We have a section in the main README for adding custom flow config, but that's only the technical side of that problem, and not really helpful if the user isn't familiar with flow components.

The structure of the doc could look like:

  • Introduction
    • If you have an application deployed on Kubernetes with Prometheus metrics...
  • Discovery
    • Is your app running as a pod?
    • Is your app running as a service?
    • Filtering to the right entity and metrics port
  • Scraping
  • Post-processing
    • Filtering the number of metrics
    • Adjusting metrics
  • Sending to prometheus
  • Adding the config to the helm chart

Inline cluster name

I admit, I was being cute when I put the cluster name in the config map and then mounted that into the agent container.

I think the original idea was to set it in one place, but the cluster name really shouldn't change much.

The cluster name being piggy-backed into the same ConfigMap as the Agent configuration means that it's harder for users to use their own ConfigMaps.

Add option to define write_relabel_config block

The feature that I am missing is to have option to define this block: https://grafana.com/docs/agent/latest/flow/reference/components/prometheus.remote_write/#write_relabel_config-block

My main reason behind this is that I want to be in full control to drop any metrics from any source or service monitor from single place. So something like this:

writeRelabelConfig: |-
    rule {
      source_labels = ["__name__"]
      regex = "metric_name|another_metric_to_drop"
      action = "drop"
    }

To my knowledge this option should be added to this config file: https://github.com/grafana/k8s-monitoring-helm/blob/main/charts/k8s-monitoring/templates/agent_config/_prometheus.river.txt#L23 and it needs to be configurable with helm values similar like you added it in this PR: #92

Add sane defaults to reduce cardinality

The current config will generate close to 10.000 metrics on an near empty cluster.

Much of the problem is that something like the cadvisor scrape for networking will generate a series per pod per interface per device per xxx easily generating 1000-2000 unique metrics.

The reality is that by far most of these unique labels are not used.

Add support for loki.source.podlogs

Adding support for PodLogs would be another place where existing integrations would "just work" and not have to be converted to flow.

Implementation would likely follow a similar flow to the ServiceMonitor and PodMonitor paths.

Error: template: k8s-monitoring/templates/grafana-agent-logs-config.yaml:5:45: executing "k8s-monitoring/templates/grafana-agent-logs-config.yaml" at <index .Subcharts "grafana-agent-logs">: error calling index: index of untyped nil

Cannot install the chart

Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.3+k3s1
helm upgrade --kube-context default --install grafana-k8s-monitoring grafana/k8s-monitoring \
    --namespace "monitoring" --create-namespace --values - <<EOF
cluster:
  name: "default"

externalServices:
  prometheus:
    host: "https://prometheus-prod-05-gb-south-0.grafana.net"
    basicAuth:
      username: "1097405"
      password: "xxxxx"

metrics:
  cost: {enabled: false}

opencost: {enabled: false}
EOF
Release "grafana-k8s-monitoring" does not exist. Installing it now.
Error: template: k8s-monitoring/templates/grafana-agent-logs-config.yaml:5:45: executing "k8s-monitoring/templates/grafana-agent-logs-config.yaml" at <index .Subcharts "grafana-agent-logs">: error calling index: index of untyped nil

Kubelet and cAdvisor don't work if default cluster domain is not supported

If the cluster does not use cluster.local, then the kubelet and cadvisor will not work correctly.

Both these metric sources work by querying all nodes, but changing the target address and path to:

  • __address__: kubernetes.default.svc.cluster.local:443 (The kubernetes service in the default namespace)
  • __path__: /api/v1/nodes/<node>/proxy/metrics (A path for that service that references the node)

So, while we discover the nodes, we hard-code the kubernetes service.

If the user's cluster's internal DNS does something like kubernetes.default.svc.my-prod-cluster.local:443, the hardcoded service address will be wrong.

Add ability to use existing ConfigMap

If a user has a highly customized config, they may want to use this chart to deploy the tools, but not deploy the config map.

Have the ability to use an existing config map. Potential values file:

configMap:
  create: true

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.